Skip to content

docs: add operations documentation guides#309

Open
WentingWu666666 wants to merge 18 commits intodocumentdb:mainfrom
WentingWu666666:wentingwu/issue-253-operations-docs
Open

docs: add operations documentation guides#309
WentingWu666666 wants to merge 18 commits intodocumentdb:mainfrom
WentingWu666666:wentingwu/issue-253-operations-docs

Conversation

@WentingWu666666
Copy link
Collaborator

@WentingWu666666 WentingWu666666 commented Mar 12, 2026

This PR adds 5 operations documentation guides for the DocumentDB Kubernetes Operator, covering day-to-day cluster management tasks.

New Documentation

Guide Description
Failover Automatic local replica promotion, cross-cluster failover for multi-region, and application connection considerations
Upgrades Operator Helm chart and CRD upgrades, per-cluster DocumentDB extension and gateway image updates, and rollback procedures
Backup & Restore On-demand and scheduled VolumeSnapshot backups, restore from backup, and retention policy configuration
Restore Deleted Cluster Recovery via VolumeSnapshot backup restore or retained PersistentVolume reattachment
Maintenance Cluster health monitoring, PostgreSQL and gateway log review, resource usage tracking, and Kubernetes events/alerts

Verification

Every command, event name, label, container name, and path in these docs was verified against:

  • Source code audit of the operator controllers
  • Live testing in a local Kind cluster (Kubernetes v1.35.0)

Key decisions

  • Scaling doc moved to separate branch (wentingwu/scaling-docs) blocked on issue Reconciliation loop does not propagate spec changes to existing CNPG clusters #306 (reconciliation loop doesn't propagate spec changes to existing clusters)
  • CRD upgrade step uses --server-side --force-conflicts plain kubectl apply fails for the large dbs.documentdb.io CRD
  • CRD upgrade placed before helm upgrade ensures new CRD fields are available when the operator starts
  • No spec.resources references DocumentDB CRD only has spec.resource.storage, not CPU/memory limits

Also includes

  • Updated CONTRIBUTING.md with MkDocs documentation testing instructions
  • Updated mkdocs.yml navigation (removed scaling.md)

Closes #253

@WentingWu666666 WentingWu666666 changed the title docs: add operations documentation (failover, maintenance, upgrades, backup, restore) docs: add operations documentation guides Mar 18, 2026
@WentingWu666666 WentingWu666666 marked this pull request as ready for review March 18, 2026 18:16
Copilot AI review requested due to automatic review settings March 18, 2026 18:16
wentingwu000 and others added 14 commits March 18, 2026 14:20
Add six new operations guides covering day-2 cluster management:

- backup-and-restore: conceptual overview, on-demand/scheduled backups,
  restore workflow, retention policy, and troubleshooting
- scaling: vertical scaling (instancesPerNode 1-3) and PVC storage
  expansion with prerequisites and monitoring
- upgrades: operator, extension, and gateway upgrade procedures,
  rolling update behavior, and rollback protection
- failover: local automatic and cross-cluster manual failover, testing
  procedures, and application connection considerations
- restore-deleted-cluster: recovery from backup or retained PV,
  verification steps, and common pitfalls
- maintenance: monitoring, log management, resource tuning, node
  maintenance, rolling restarts, and routine checklists

Update mkdocs.yml with new Operations navigation section.

Refs documentdb#253

Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Add YAML front matter (title, description, tags) to all 6 operations docs
- Rewrite Overview sections: what the operation is + why it matters
- Disambiguate all bare 'cluster' to 'DocumentDB cluster' or 'Kubernetes cluster'
- Disambiguate 'operator' to 'DocumentDB operator' in upgrades doc
- backup-and-restore: add CSI link, multi-region section, tabbed prerequisites,
  YAML block titles, restore constraints, cross-ref to networking for mongosh
- restore-deleted-cluster: route Method 1 to backup-and-restore, remove internal
  details section, add YAML title, cross-ref to networking for verify step
- scaling: replace unsupported storage expansion with link to storage config
- upgrades: remove unnecessary backup step from operator upgrade, replace heredoc
  with YAML block, use placeholder versions instead of fake 1.2.0
- maintenance: fix broken link to removed storage-expansion anchor
- Update configuration front matter descriptions to match actual content

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Replace "CNPG monitors/promotes/triggers" with "the operator
monitors/promotes/triggers" in prose explanations across all operations
docs (failover, maintenance, scaling, upgrades, backup-and-restore).
Resource names like clusters.postgresql.cnpg.io are preserved in
kubectl commands that users need to run.

Also restructures several sections into Material for MkDocs tabbed
format for improved readability and fixes the troubleshooting namespace
reference from cnpg-system to documentdb-operator.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Refine all operations documentation based on review feedback:

- scaling.md: mirror structure for scale up/down tabs, fix "at least 2"
  for failover, remove unnecessary checklist
- failover.md: fix networking cross-reference anchor, remove false
  connection pooling/quorum claims, fix replica read claim
- upgrades.md: merge extension+gateway into single component upgrade
  (documentDBVersion upgrades both), move pre-upgrade checklist under
  component upgrades, simplify overview table, remove cluster health
  check from operator verify
- backup-and-restore.md: convert on-demand/scheduled to tabs with API
  refs, fix CSI prerequisite wording, add YAML title, update retention
  policy to table format, improve backup identification step
- maintenance.md: clarify logLevel scope (PostgreSQL only), remove fake
  resource allocation table, add PVC resize planned note, clarify
  cordon terminology
- restore-deleted-cluster.md: fix broken anchor references
- mkdocs.yml: reorder nav (failover before upgrades)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- failover: fix misleading write-only downtime claim to cover both reads
  and writes, add playground links for cross-cluster failover, explain
  instancesPerNode >= 2 requirement explicitly, merge behavior sections
- maintenance: add normal/investigate guidance for each maintenance task
  so users know what to expect and when to troubleshoot
- upgrades: add rollback sections with schema version check guidance
  (rollback if schema not upgraded, otherwise restore from backup)

All failover doc claims verified against source code and tested in Kind
cluster (3-instance cluster, primary deletion triggers automatic
failover with data preservation confirmed).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Scaling operations (instancesPerNode, pvcSize changes) do not propagate
to existing CNPG clusters due to the reconciliation loop gap documented
in issue documentdb#306. Moving scaling doc to a separate branch until the
operator bug is fixed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Upgrade doc fixes verified against source code and Kind cluster:
- Fix downgrade behavior: operator skips schema migration but still
  updates images (not 'rejects the change')
- Fix rolling update: primaryUpdateMethod=restart means primary is
  restarted in place (no switchover)
- Fix health check: operator checks primary pod health, not all pods
- Fix CRD handling: Helm crds/ dir only applies on install, not upgrade
- Remove misleading 'zero-downtime' from description

Maintenance doc cleanup:
- Remove CNPG-internal Advanced Diagnostics section
- Remove troubleshooting section with CNPG-specific commands
- Remove broken link to scaling doc (moved to separate branch)
- Reorganize Routine Checks section placement

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Fix CRD URL from microsoft/ to documentdb/ GitHub org
- List all 3 CRDs (dbs, backups, scheduledbackups) instead of just 1
- Fix image override examples to use correct repo path:
  ghcr.io/documentdb/documentdb-kubernetes-operator/documentdb
  ghcr.io/documentdb/documentdb-kubernetes-operator/gateway

All 24 claims in the upgrade doc verified against source code
and local Kind cluster.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Fix backup status from 'Succeeded' to 'completed' (actual phase value)
- Add missing metadata.name field to on-demand backup YAML example
- Apply same status fix in restore-deleted-cluster doc

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
…enance doc

The DocumentDB CRD has spec.resource.storage (for PVC config) but no
spec.resources.limits for CPU/memory. Replace with generic guidance
based on kubectl top output.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The DocumentDB CRD (dbs.documentdb.io) exceeds the annotation size
limit for client-side kubectl apply, causing 'metadata.resourceVersion:
Invalid value: 0' errors. Switch to --server-side --force-conflicts
which avoids this limitation.

Verified in Kind cluster: CRD apply, helm upgrade (test->dev),
and helm rollback all tested successfully with zero DocumentDB
cluster disruption.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
- Remove non-existent INSTANCES column from kubectl get documentdb table
- Fix pod label selector from documentdb.io/cluster to app=<cluster-name>
- Fix PG log path from postgresql.log to /controller/log/postgres
- Fix gateway container name from gateway to documentdb-gateway
- Replace non-existent BackupSucceeded event with real BackupSchedule event
- Replace non-existent FailoverCompleted event with real InvalidSchedule event
- Fix PVRetained event name to PVsRetained (plural, matches source code)

All fixes verified against Kind cluster and operator source code.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The scaling operations doc was moved to the wentingwu/scaling-docs branch
pending resolution of issue documentdb#306. Remove the nav entry to avoid a broken
link in the docs build.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Update the YAML description field in each operations doc so it
accurately summarises the sections in that file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the public “Preview” documentation by adding a new Operations section (failover, upgrades, backup/restore, restore-deleted-cluster, maintenance) and updates several existing configuration page descriptions for clarity.

Changes:

  • Adds new Operations documentation pages under docs/operator-public-documentation/preview/operations/.
  • Updates mkdocs.yml navigation to surface the new Operations section.
  • Refines YAML frontmatter description text for networking, TLS, and storage configuration docs.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
mkdocs.yml Adds “Operations” nav entries to expose new operational guides (but currently also keeps an existing “Backup and Restore” entry at the same level).
docs/operator-public-documentation/preview/operations/upgrades.md New guide describing operator vs component upgrades and rollback considerations.
docs/operator-public-documentation/preview/operations/failover.md New failover guide covering local and multi-region/cross-cluster promotion.
docs/operator-public-documentation/preview/operations/backup-and-restore.md New backup/restore guide using VolumeSnapshots and Backup/ScheduledBackup CRs.
docs/operator-public-documentation/preview/operations/restore-deleted-cluster.md New recovery guide describing restore via Backup or retained PVs.
docs/operator-public-documentation/preview/operations/maintenance.md New maintenance guide covering health checks, logs, resource monitoring, and events.
docs/operator-public-documentation/preview/configuration/tls.md Updates page description to better reflect supported TLS modes and content.
docs/operator-public-documentation/preview/configuration/storage.md Updates page description to remove unsupported “volume expansion” claim.
docs/operator-public-documentation/preview/configuration/networking.md Updates page description to highlight mongosh connection and Service types.

@WentingWu666666 WentingWu666666 force-pushed the wentingwu/issue-253-operations-docs branch from cf17f84 to 39f59a0 Compare March 18, 2026 18:22
wentingwu000 and others added 2 commits March 18, 2026 14:24
- Add missing metadata.name to ScheduledBackup example
- Fix GitHub org in failover cross-links (microsoft -> documentdb)
- Remove duplicate top-level 'Backup and Restore' nav entry from mkdocs.yml

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
The content is now covered by the Operations section:
- operations/backup-and-restore.md (backup, restore, retention)
- operations/restore-deleted-cluster.md (PV recovery)

Update cross-references in faq.md and storage.md to point to new paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
List backups for your DocumentDB cluster and choose one in `completed` status:

```bash
kubectl get backups -n default
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: is it in default namespace?

### Step 4: Upgrade the DocumentDB Operator

```bash
helm upgrade documentdb-operator documentdb/documentdb-operator \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add helm upgrade --skip-crds as we upgraded the CRDs manually above?

```

### Rollback and Recovery

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for automatic rollback we can utilize helm upgrade my-release my-chart --atomic?

spec:
gatewayImage: "ghcr.io/documentdb/documentdb-kubernetes-operator/gateway:<version>"
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we talk about DocumentDB Cluster udpate? Once the operator or schema updates are done, we want to migrate cluster to newer versions.


Backups protect your DocumentDB cluster against data loss from accidental deletion, corruption, or failed upgrades. A reliable backup strategy is the foundation of any production deployment — without it, recovery may be impossible.

The DocumentDB operator provides a snapshot-based backup system built on Kubernetes [VolumeSnapshots](https://kubernetes.io/docs/concepts/storage/volume-snapshots/). Each backup captures a point-in-time copy of the primary instance's persistent volume, which can later be used to bootstrap a new DocumentDB cluster.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

point-in-time might not be th best word since it sounds like point-in-time restore... at a minimum explain that the data accumulated after a backup and before a crash might be lost


## Prerequisites

Before creating backups, ensure your Kubernetes cluster has the required snapshot infrastructure.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/infrastrucure/support/g

## Local Automatic Failover

Local automatic failover requires at least two instances (`spec.instancesPerNode >= 2`). With a single instance, there is only the primary and no replica available to promote — so failover is not possible. When multiple instances are running, the operator automatically promotes a replica to primary if the current primary becomes unavailable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we recomend to match the # of local replicas to the number of availability zones


In a multi-region setup:

- One DocumentDB cluster is designated as the **primary** and handles all writes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

primary can be setup as a "HA cluster" thus having replicas providing local HA and only necessitating a faiolver to another region under extraordinary cisrcumstances...


## Log Management

=== "DocumentDB Operator Logs"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ecommend to set up a centralzied lof colelction as part of your observability strategy (see observanilty chapter)


```yaml
spec:
logLevel: "info" # Options: debug, info, warning, error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we default to info? In prod it should run warn or error?

```

### Step 2: Review Available Versions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: per release polciy (see ...) we only support ...


| Upgrade Type | What Changes | How to Trigger |
|-------------|-------------|----------------|
| **DocumentDB operator** | The Kubernetes operator itself | Helm chart upgrade |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also specify tht we upgarde CNPG for you - is there a way to skip that?


## Component Upgrades

Updating `spec.documentDBVersion` upgrades **both** the DocumentDB extension and the gateway together, since they share the same version.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shoudl probably explain how we ensure that everyhting is deployed before we upgrade the scheam on multi-region. This statement youw rote is confusing because it impleas the schema gets updated automatically hich we don't want in multi-region

1. You update the `spec.documentDBVersion` field.
2. The operator detects the version change and updates both the database image and the gateway sidecar image.
3. The underlying cluster manager performs a **rolling restart**: replicas are restarted first one at a time, then the **primary is restarted in place**. Expect a brief period of downtime while the primary pod restarts.
4. After the primary pod is healthy, the operator runs `ALTER EXTENSION documentdb UPDATE` to update the database schema.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what in multi-region?

wentingwu000 and others added 2 commits March 19, 2026 10:24
backup-and-restore.md:
- Replace 'point-in-time copy' with 'crash-consistent snapshot' and
  clarify that PITR is not supported (data loss between snapshot and
  failure)
- s/infrastructure/support/ in prerequisites
- Use <namespace> placeholder instead of hardcoded 'default'

failover.md:
- Add tip: match instancesPerNode to number of availability zones
- Clarify that primary cluster can itself be multi-instance HA,
  reducing need for cross-region failover

maintenance.md:
- Add centralized log collection recommendation with link to
  telemetry playground
- Change logLevel example to 'warning' and add production tip

upgrades.md:
- Document that CloudNative-PG is bundled and upgraded automatically
- Add release strategy support window note
- Add --skip-crds to helm upgrade (CRDs are applied manually)
- Add --atomic tip for automatic rollback
- Add cross-link from operator upgrade to component upgrades
- Document multi-region upgrade order (standbys first, primary last)
- Document multi-region schema migration behavior (primary-only)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
helm upgrade does not touch CRDs at all (per Helm docs), so
--skip-crds is a no-op and misleading.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Wenting Wu <wentingwu@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DOCS] Operations documentation

5 participants